Everything about Homology Modeling totally explained
In
protein structure prediction,
homology modeling, also known as
comparative modeling, is a class of methods for constructing an atomic-resolution model of a
protein from its
amino acid sequence (the "query sequence" or "target"). Almost all homology modeling techniques rely on the identification of one or more known protein structures (known as "templates" or "parent structures") likely to resemble the structure of the query sequence, and on the production of an
alignment that maps residues in the query sequence to residues in the template sequence. The sequence alignment and template structure are then used to produce a structural model of the target. Because protein structures are more
conserved than protein sequences, detectable levels of sequence similarity usually imply significant structural similarity.
The quality of the homology model is dependent on the quality of the sequence alignment and template structure. The approach can be complicated by the presence of alignment gaps (commonly called indels) that indicate a structural region present in the target but not in the template, and by structure gaps in the template that arise from poor resolution in the experimental procedure (usually
X-ray crystallography) used to solve the structure. Model quality declines with decreasing
sequence identity; a typical model has ~2
Å agreement between the matched C
α atoms at 70% sequence identity but only 4-5
Å agreement at 25% sequence identity. Regions of the model that were constructed without a template, usually by
loop modeling, are generally much less accurate than the rest of the model, particularly if the loop is long. Errors in
side chain packing and position also increase with decreasing identity, and variations in these packing configurations have been suggested as a major reason for poor model quality at low identity. Taken together, these various atomic-position errors are significant and impede the use of homology models for purposes that require atomic-resolution data, such as
drug design and
protein-protein interaction predictions; even the
quaternary structure of a protein may be difficult to predict from homology models of its subunit(s). Nevertheless, homology models can be useful in reaching
qualitative conclusions about the biochemistry of the query sequence, especially in formulating hypotheses about why certain residues are conserved, which may in turn lead to experiments to test those hypotheses. For example, the spatial arrangement of conserved residues may suggest whether a particular residue is conserved to stabilize the folding, to participate in binding some small molecule, or to foster association with another protein or nucleic acid.
Homology modeling can produce high-quality structural models when the target and template are closely related, which has inspired the formation of a
structural genomics consortium dedicated to the production of representative experimental structures for all classes of protein folds. The chief inaccuracies in homology modeling, which worsen with lower
sequence identity, derive from errors in the initial sequence alignment and from improper template selection. Like other methods of structure prediction, current practice in homology modeling is assessed in a biannual large-scale experiment known as the Critical Assessment of Techniques for Protein Structure Prediction, or
CASP.
Motivation
The method of homology modeling is based on the observation that protein
tertiary structure is better conserved than
amino acid sequence. However, such a massive structural rearrangement is unlikely to occur in
evolution, especially since the protein is usually under the constraint that it must
fold properly and carry out its function in the cell. Consequently, the roughly folded structure of a protein (its "topology") is conserved longer than its amino-acid sequence and much longer than the corresponding DNA sequence; in other words, two proteins may share a similar fold even if their evolutionary relationship is so distant that it can't be discerned reliably. For comparison, the function of a protein is conserved much
less than the protein sequence, since relatively few changes in amino-acid sequence are required to take on a related function.
Steps in model production
The homology modeling procedure can be broken down into four sequential steps: template selection, target-template alignment, model construction, and model assessment. "Profile-profile" alignments that first generate a sequence profile of the target and systematically compare it to the sequence profiles of solved structures; the coarse-graining inherent in the profile construction is thought to reduce noise introduced by
sequence drift in nonessential regions of the sequence.
Model generation
Given a template and an alignment, the information contained therein must be used to generate a three-dimensional structural model of the target, represented as a set of
Cartesian coordinates for each atom in the protein. Three major classes of model generation methods have been proposed.
Fragment assembly
The original method of homology modeling relied on the assembly of a complete model from
conserved structural fragments identified in closely related solved structures. For example, a modeling study of
serine proteases in
mammals identified a sharp distinction between "core" structural regions conserved in all experimental structures in the class, and variable regions typically located in the
loops where the majority of the sequence differences were localized. Thus unsolved proteins could be modeled by first constructing the conserved core and then substituting variable regions from other proteins in the set of solved structures. Current implementations of this method differ mainly in the way they deal with regions that are not conserved or that lack a template.
Segment matching
The segment-matching method divides the target into a series of short segments, each of which is matched to its own template fitted from the
Protein Data Bank. Thus, sequence alignment is done over segments rather than over the entire protein. Selection of the template for each segment is based on sequence similarity, comparisons of
alpha carbon coordinates, and predicted
steric conflicts arising from the
van der Waals radii of the divergent atoms between target and template.
Satisfaction of spatial restraints
The most common current homology modeling method takes its inspiration from calculations required to construct a three-dimensional structure from data generated by
NMR spectroscopy. One or more target-template alignments are used to construct a set of geometrical criteria that are then converted to
probability density functions for each restraint. Restraints applied to the main protein
internal coordinates -
protein backbone distances and
dihedral angles - serve as the basis for a
global optimization procedure that originally used
conjugate gradient energy minimization to iteratively refine the positions of all heavy atoms in the protein.
This method had been dramatically expanded to apply specifically to loop modeling, which can be extremely difficult due to the high flexibility of loops in proteins in
aqueous solution. A more recent expansion applies the spatial-restraint model to
electron density maps derived from
cryoelectron microscopy studies, which provide low-resolution information that isn't usually itself sufficient to generate atomic-resolution structural models. To address the problem of inaccuracies in initial target-template sequence alignment, an iterative procedure has also been introduced to refine the alignment on the basis of the initial structural fit. The most commonly user software in spatial restraint-based modeling is
MODELLER and a database called ModBase has been established for reliable models generated with it.
Loop modeling
Regions of the target sequence that are not aligned to a template are modeled by
loop modeling; they're the most susceptible to major modeling errors and occur with higher frequency when the target and template have low sequence identity. The coordinates of unmatched sections determined by loop modeling programs are generally much less accurate that those obtained from simply copying the coordinates of a known structure, particularly if the loop is longer than 10 residues. The first two sidechain
dihedral angles (χ
1 and χ
2) can usually be estimated within 30° for an accurate backbone structure; however, the later dihedral angles found in longer side chains such as
lysine and
arginine are notoriously difficult to predict. Moreover, small errors in χ
1 (and, to a lesser extent, in χ
2) can cause relatively large errors in the positions of the atoms at the terminus of side chain; such atoms often have a functional importance, particularly when located near the
active site.
Model assessment
Assessment of homology models without reference to the true target structure is usually performed with two methods:
statistical potentials or physics-based energy calculations. Both methods produce an estimate of the energy (or an energy-like analog) for the model or models being assessed; independent criteria are needed to determine acceptable cutoffs. Neither of the two methods correlates exceptionally well with true structural accuracy, especially on protein types underrepresented in the
PDB, such as
membrane proteins.
Statistical potentials are empirical methods based on observed residue-residue contact frequencies among proteins of known structure in the PDB. They assign a probability or energy score to each possible pairwise interaction between
amino acids and combine these pairwise interaction scores into a single score for the entire model. Some such methods can also produce a residue-by-residue assessment that identifies poorly scoring regions within the model, though the model may have a reasonable score overall. These methods emphasize the
hydrophobic core and
solvent-exposed
polar amino acids often present in
globular proteins. Examples of popular statistical potentials include
Prosa and
DOPE. Statistical potentials are more computationally efficient than energy calculations.
A very extensive model validation report can be obtained using the software. WHAT_CHECK is one option of the software package; it produces a many page document with extensive analyses of nearly 200 scientific and administrative aspects of the model. WHAT_CHECK is available as a ; it can also be used to validate experimentally determined structures of macromolecules.
One newer method for model assessment relies on
machine learning techniques such as
neural nets, which may be trained to assess the structure directly or to form a consensus among multiple statistical and energy-based methods. Very recent results using
support vector machine regression on a jury of more traditional assessment methods outperformed common statistical, energy-based, and machine learning methods.
Structural comparison methods
The assessment of homology models' accuracy is straightforward when the experimental structure is known. The most common method of comparing two protein structures uses the
root-mean-square deviation (RMSD) metric to measure the mean distance between the corresponding atoms in the two structures after they've been superimposed. However, RMSD does underestimate the accuracy of models in which the core is essentially correctly modeled, but some flexible
loop regions are inaccurate. A method introduced for the modeling assessment experiment
CASP is known as the
global distance test (GDT) and measures the total number of atoms whose distance from the model to the experimental structure lies under a certain distance cutoff.
Benchmarking
Several large-scale
benchmarking efforts have been made to assess the relative quality of various current homology modeling methods.
CASP is a community-wide prediction experiment that runs every two years during the summer months and challenges prediction teams to submit structural models for a number of sequences whose structures have recently been solved experimentally but have not yet been published. Its partner
CAFASP has run in parallel with CASP but evaluates only models produced via fully automated servers. Continuously running experiments that don't have prediction 'seasons' focus mainly on benchmarking publicly available webservers.
LiveBench and
EVA run continuously to assess participating servers' performance in prediction of imminently released structures from the PDB. CASP and CAFASP serve mainly as evaluations of the state of the art in modeling, while the continuous assessments seek to evaluate the model quality that would be obtained by a non-expert user employing publicly available tools.
Accuracy
The accuracy of the structures generated by homology modeling is highly dependent on the sequence identity between target and template. Above 50% sequence identity, models tend to be reliable, with only minor errors in
side chain packing and
rotameric state, and an overall
RMSD between the modeled and the experimental structure falling around 1
Â. This error is comparable to the typical resolution of a structure solved by NMR. In the 30-50% identity range, errors can be more severe and are often located in loops. Below 30% identity, serious errors occur, sometimes resulting in the basic fold being mis-predicted.
At high sequence identities, the primary source of error in homology modeling derives from the choice of the template or templates on which the model is based, while lower identities exhibit serious errors in sequence alignment that inhibit the production of high-quality models.
Attempts have been made to improve the accuracy of homology models built with existing methods by subjecting them to
molecular dynamics simulation in an effort to improve their RMSD to the experimental structure. However, current
force field parameterizations may not be sufficiently accurate for this task, since homology models used as starting structures for molecular dynamics tend to produce slightly worse structures. Slight improvements have been observed in cases where significant restraints were used during the simulation.
Sources of error
The two most common and large-scale sources of error in homology modeling are poor template selection and inaccuracies in target-template sequence alignment. Controlling for these two factors by using a
structural alignment, or a sequence alignment produced on the basis of comparing two solved structures, dramatically reduces the errors in final models; these "gold standard" alignments can be used as input to current modeling methods to produce quite accurate reproductions of the original experimental structure.
The
rotameric states of side chains and their internal packing arrangement also present difficulties in homology modeling, even in targets for which the backbone structure is relatively easy to predict. This is partly due to the fact that many side chains in crystal structures are not in their "optimal" rotameric state as a result of energetic factors in the
hydrophobic core and in the packing of the individual molecules in a protein crystal. One method of addressing this problem requires searching a rotameric library to identify locally low-energy combinations of packing states. It has been suggested that a major reason that homology modeling so difficult when target-template sequence identity lies below 30% is that such proteins have broadly similar folds but widely divergent side chain packing arrangements. Even low-accuracy homology models can be useful for these purposes, because their inaccuracies tend to be located in the loops on the protein surface, which are normally more variable even between closely related proteins. The functional regions of the protein, especially its
active site, tend to be more highly conserved and thus more accurately modeled. Used in conjunction with
molecular dynamics simulations, homology models can also generate hypotheses about the kinetics and dynamics of a protein, as in studies of the ion selectivity of a
potassium channel. Large-scale automated modeling of all identified protein-coding regions in a
genome has been attempted for the
yeast Saccharomyces cerevisiae, resulting in nearly 1000 quality models for proteins whose structures hadn't yet been determined at the time of the study, and identifying novel relationships between 236 yeast proteins and other previously solved structures.
Further Information
Get more info on 'Homology Modeling'.
|
External Link Exchanges
Do you know how hard it is to get a link from a large encyclopaedia? Well we're different and will prove it. To get a link from us just add the following HTML to your site on a relevant page:
<a href="http://homology_modeling.totallyexplained.com">Homology modeling Totally Explained</a>
Then simply click through this link from your web page. Our crawlers will verify your link, extract the title of your web page and instantly add a link back to it. If you like you can remove the words Totally Explained and embed the link in article text.
As long as your link remains in place, we'll keep our link to you right here. Please play fair - our crawlers are watching. Your site must be closely related to this one's topic. Any kind of spamming, dubious practises or removing the link will result in your link from us being dropped and, potentially, your whole site being banned. |